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Abstract 

mlpy is a Python Open Source Machine Learning library built on top of NumPy/SciPy 
and the GNU Scientific Libraries, mlpy provides a wide range of state-of-the-art machine 
learning methods for supervised and unsupervised problems and it is aimed at finding a 
reasonable compromise among modularity, maintainability, reproducibility, usability and 
efficiency, mlpy is multiplatform, it works with Python 2 and 3 and it is distributed under 



GPL3 at the website http://mlpy.fbk.eu 
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1. Overview 

We introduce here mlpy, a library providing access to a wide spectrum of machine learn- 
ing methods implemented in Py thon, which has pr oven to be an effective environment for 



building scientific oriented tools (jPerez et all 120111 ). Although planned for general purpose 



applications, mlpy has the computational biology in general, and the functional genomics 
modeling in particular, as the elective application fields. As a major applications example, 
we use mlpy methods to implement molecu lar profiling experi ments that need to warrant 
study reproducibility ( Ioannidis et al. . 20091 ) and flawless results ( Ambroise and McLachlan . 
20021 ). This task requires the availability of highly modular tools allowing the praction- 
ers to build an adequate workflow for the task at hand follow ing authoritative guidelines 
(|The MicroArray Quality Control (MAQC) Consortium! . I2010I ). Such workflow involves a 
complex sequence of steps, both in the development and in the validation phases, start- 
ing from the upstream preprocessing algorithms to the downstream predictive analysis, 
repeated several times to accommodate the resampling schema. The dimension of high- 
throughtput data involved (thousands of samples described by millions of features) and 
the large number of replicates needed to control bias effects make also efficiency an essen- 
tial requirement, mlpy is aimed at reaching a good compromise among code modularity, 
usability and efficiency. In this spirit, mlpy finds a different equilibrium among all these 
characteristics, being more inclined towards flexibility than similar projects such as scikits- 
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learn (IPedregosa et al 



201 lb . PvMVP A dHanke et all 120091) . Pv Brain (jSchaul et al.l . l2010h . 



MDP (|Zito et all l2009h and Shogun (jSonnenburg et all l201oh . In some areas the set of 



provided tools are among the most complete (e.g., wavelets) or even the only one (Canberra 
indicator for feature list stability) to be found. In particular, mlpy supplies the biologist 
researcher with state-of-the-art implementations of many well known algorithms with at- 
tention for novel methods appearing in literature, and the bioinformatician more inclined 
to programming with a modular environment where to embed his favourite methods. How- 
ever, mlpy usage is not confined to bioinformatics: applications to computer vision, emotion 
detection, seismology, etology have been published in literatur^. mlpy works on Python 
2 and 3 and it is available for Linux, Mac OS X and Microsoft Windows (Xp, Vista, 7) 
platforms, under the GPL3 licence. User documentation is written in Sphinx and it comes 
either online or as a downloadable manual in PDF format. Together with the library de- 
scription, a tutorial with several examples is provided as part of the documentation. Due 
to design, separate documentation on API references is not needed: however, support for 
both final users and developers is offered by mean of a dedicated mailing list at the website 
http://groups.google.com/group/mlpy-general, mlpy has been listed in the Machine 
Learning Open Source Software (MLOSS) repositor}{§ since February 2008. 



2. Background and Requirements 

mlpy is built on top of the NumPy/SciPy packages, the GNU Scientific Library (GSL) and 
it makes an extensive use of the Cythoio language: these are prerequisites for the library 
installation. NumPy and SciPy modules provide sophisticated V-dimensional array object, 
basic linear algebra functions and collect a variety of high level algorithms for science and 
engineering. The GNU Scientific Library (GSL) is the well-known module for numerical 
calculations written in C. Cython is a language very close to Python that allows generating 
very efficient C code and wrappin g external C/C++ lib raries, mlpy inc ludes an e f ficient 
Cython wrapper for the LibSVM fjChang and Linl . boill ) and LibLinear (|Fan et al. 1. l2008h 



C++ libraries. These implementations are reference for Support Vector Machines and 
large-scale linear classification, respectively, mlpy is fully compatible with Pylnstallefl a 
software that converts Python packages and scripts into stand-alone executables for several 
platforms. 



3. Library Features 

The core of the library consists of a number of classical and more recent algorithms for 
classification, regression and dimensionality reduction, such as methods from the Support 
Vector Machines (SVM) and the Discriminant Analysis families, and their (mostly kernel- 
based) variants. In particular, the set of classifiers includes Linear and Kernel SVM, Lin- 
ear Discriminant Analysis (LDA), Diagonal Linear Discriminant Analysis (DLDA), Basic 
Perceptron, Logistic Regression, Elastic Net, Golub Classifier, Kernel Fisher Discriminant 

1. http://mlpy.sf.net/refs 

2. http://mloss.org 

3. http://www.cython.org/ 

4. http://www.pyinstaller.org/ 
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Analysis (KFDA), Parzen-based, k-nearest neighbor (KNN), classification tree, Maximum 
Likelihood. The implemented regressors are Ordinary (Linear) Least Squares, Linear and 
Kernel Ridge, Partial Least Squares, LARS, Elastic Net, Linear and Kernel SVM. Finally, 
Fisher Discriminant Analysis (FDA), Kernel FDA, Spectral Regression Discriminant Anal- 
ysis (SRDA), Principal Component Analysis (PCA), Kernel PCA are the implemented 
dimensionality reduction algorithms. Default values are provided for each classifier's pa- 
rameter. Distinct methods are deployed for the training (learn ()), the testing (predQ) for 
classification and regression, and the projection (transform ()) for the dimensionality re- 
duction algorithms. Whenever possible, functions are provided to access model parameters 
(for example, hyperplane coefficients or tranformation matrix) and other algorithm-specific 
information. Kernel-based functions are managed through a common kernel layer. In par- 
ticular, the user can choose whether supplying either the data or a precomputed kernel in 
input space: linear, polynomial, Gaussian, exponential and sigmoid kernels are available 
as default choices, and custom kernels can be defined as well. Many classification and 
regression algorithms are endowed with an internal feature ranking procedure: in alter- 
native, mlpy implements the I-Relief algorithm. Recursive Feature Elimination (RFE) for 
linear classifiers and the KFDA-RFE algorithm are available for feature se lection. Meth- 



ods f or feature list analysis (for example the Canberra stability indicator (IJurman et al 



2008h ). data resampling and error evaluation are provided, together with different clustering 



analysis methods (Hierarchical, Memory-saving Hierarchical, k- means). Finally, dedicated 
submodules are included for longitudinal data analysis through wavelet transform (Contin- 
uous, Discrete and Undecimated) and dynamic programming algorithms (Dynamic Time 
Warping and variants). 

Example As a working example illustrating the use of the library in a simple machine 
learning task, we report the lines of code needed to perform a PCA followed by a SVM 
classification. In particular, we detail the operational steps needed to project the samples 
of a uci dataset on the cartesian plane generated by the first two principal components, to 
train a kernel SVM on the projected data and to test the trained model on the same data. 
The dataset chosen for this dimensionality reduction example is the Iris dataset, collecting 
150 observations of 3 different classes of iris flowers, each described by 4 attributes. 

»> iris. shape # 2d numpy array, 150 observations and 4 attributes 
(150, 4) 

»> import mlpy # import the mlpy module 

»> pea = mlpy.PCAO # build a new PCA instance 

»> pea. learn(iris) # perform the PCA on the Iris dataset 

»> iris_pc = pca.transform(iris, k=2) # project Iris on the first 2 PCs 

»> svm = mlpy. LibSvm(kernel_type= ' linear') # build a new LibSVM instance 

»> svm. learn(iris_pc , labels) # train the model 

»> labels_pred = svm.pred(iris_pc) # test the model 

»> mlpy . error (labels , labels_pred) # compute the prediction error 

0.033 



5. http : //archive . ics .uci . edu/ml/datasets .html 
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